We adopt a top-down approach to simulate hierarchical networks, considering various simulation parameters such as graph sparsity, noise, and the architecture of the super-level graph(s), including small-world, scale-free, and random graph networks (Watts and Strogatz 1998; Barabási and Bonabeau 2003).
Our simulations focus on basic hierarchies comprising one or two hierarchical layers. Two-layer networks mirror classical community detection on graphs, where our aim is to recover the true community labels from a given graph. Meanwhile, three-layer networks present a more intricate scenario, where the bottom layer of the hierarchy contains two levels of community structure. Here, the top level corresponds to the nodes at the uppermost layer of the hierarchy, and the middle level consists of communities nested within the top-level communities. The objective with these networks is to identify both sets of community partitions.
In each hierarchy, for fully connected networks, we initiate by simulating \(N_{\text{top}}\) top-level nodes, adhering to a directed small-world, random graph, or scale-free network architecture (Watts and Strogatz 1998; Barabási and Bonabeau 2003). In cases where the network is disconnected, we simply simulate \(N_{\text{top}}\) disconnected nodes. For networks with three hierarchical layers, we then generate a subnetwork of \(N_{\text{middle}}\) nodes from each top-layer node, adhering to the network structure utilized at the top level. If the network is fully connected, we apply a probability \(p_\text{between}\) to the nodes from different top-level communities being connected.
The final step in all hierarchies is to generate the nodes in the observed (bottom) layer of the hierarchy. For each top-layer or middle-layer node, we generate a sub-network of \(N_{\text{bottom}}\) nodes under the same sub-network structure as the previous layers, and we apply a probability \(p_\text{between}\) for nodes from different communities to share an edge.
Once we simulate a hierarchical graph, we utilize this hierarchy to generate the node-feature matrix, which depicts the expression of \(N\) genes across \(d\) samples. Here, \(N\) denotes the number of nodes in the observed (bottom) layer of the hierarchy.
We simulate the node-feature matrix using the topological order of the observed level graph. We start by generating the features of nodes that have no parental input. We refer to these nodes as origin nodes. All origin nodes are simulated from a normal distribution with mean \(0\) and standard deviation \(\sigma\). All other nodes are simulated from a normal distribution centered at the mean of their parent nodes and with standard deviation \(\sigma\).
Our HCD method consists of two modules:
A graph autoencoder based on the architecture proposed by Salehi and Davulcu (2019) which utilizes graph attention layers such as those first introduced by Veličković et al. (2017) (See most recent version of pseudocode for details). In our applications, we incorporate multi-head attention in all encoder and decoder layers to expand model learning capacity. The graph autoencoder module takes a set of node attributes \({\bf X}\in\mathbb{R}^{N \times d}\) and an adjacency matrix \({\bf A}_{ij}\in[0,1]\) defining the relationships between the node as input and learns a low dimensional embedding \({\bf Z} \in \mathbb{R}^{N \times q}\) of the network and attributes. This embedding is then used to reconstruct the node attributes and adjacency matrix under a separate loss function for each.
The second module of HCD takes the embeddings \({\bf Z}\) generated by the autoencoder and applies a multilevel community detection process. This module first applies the function \(f_{top}\) which partitions the data into \(k\) groups. The function \(f_{top}\) can be either (i) the \(SoftKMeans\) layer described by (Falkner 2022) or a neural layer such as
Simple Linear Layer:
\[{\bf P} = f_{top}({\bf Z}) = \text{Softmax}(\theta({\bf ZW} + {\bf b}))\] The above example is a simple fully connected layer where \({\bf W} \in \mathbb{R}^{N \times k}\) is a matrix of parameters, \({\bf b} \in \mathbb{R}^N\) is a bias parameter vector and \(\theta\) is an activation function such as the \(ReLU\) function.
Alternative types of layer can also be used such heuristic KMeans or other GNN-type layers such as GraphSAGE convolution [@].
Using gradient optimized SoftKMeans
\[{\bf P} = f_{top}({\bf Z}) = \text{Softmax}(g({\bf Z})) \] where \(g()\) represents the KMeans algorithm (Falkner 2022)
GraphSAGE Convolutional Layer
\[{\bf p}_v = f_{top}({\bf z}_v) = \text{Softmax}\left(\theta\left( W · CONCAT\big({\bf z}_v,\ AGGREGATE\big({\bf z}_u : u \in N(v) \big) \big) \right)\right)\] where \({\bf p}_v\) is the soft assignment of node \(v\)
Here we describe the settings for three different sets of simulations. For each set of simulations, we simulated hierarchical gene networks consisting of 5 top level nodes/communities, 15 middle level nodes/communities and 300 nodes/genes at the observed level of the hierarchy. We simulate networks for all three graph types (small world, scale free or random graph) with the nodes at the top level of the hierarchy either all connected or all disconnected. We also simulated nodes under two levels of noise with noise set to either \(\sigma = 0.1\) or \(\sigma = 0.5\). Additionally, we sample the probability for edges within and between node communities at the middle layer and bottom level from the uniform distributions
Each combination of parameters (graph type, connectivity, and noise) is replicated multiple times.
To evaluate the performance of HCD, we conducted a series of simulation studies under various conditions. Below, we describe each simulation condition in detail:
These simulation conditions are designed to rigorously assess the performance of HCD under diverse scenarios. By varying the partitioning methods, output sizes, and use of prior information, we aim to gain insights into HCD’s robustness, adaptability, and effectiveness in uncovering hierarchical community structures in genomic data.
We evaluate the performance of our HCD method using three graph-based clustering metrics:
homogeneity evaluates the degree to which each predicted community contains only data points from a single true community, indicating how well the algorithm avoids mixing different groups. Thus, homogeneity tends to be high if resolved communities contain only members of the same true community.
completeness assesses the extent to which all data points that belong to the same true community are correctly assigned to a single predicted community. Thus completeness is always high if all members of the same true communities end up in the same resolved community even if several true communities are allocated together.
NMI Normalized Mutual Information (NMI) is a weighted average of the Completeness and Homogeneity two metrics.
ARI The Adjusted Rand Index (ARI) is a metric that measures the similarity between two different clusterings of the same data, correcting for the chance of random agreement. It ranges from -1 to 1, where 1 indicates perfect agreement between the clusterings, 0 represents random labeling, and negative values indicate less agreement than expected by chance.
In all simulations we compare HCD with two commonly used heuristic methods:
Louvain Method: This widely used community detection algorithm optimizes modularity by iteratively reassigning nodes between communities and merging communities. It automatically determines the number of communities and is highly efficient for processing large networks. In all simulations, we apply the Louvain method to the estimated adjacency matrix, which is derived from the correlation matrix of the node features. The resulting community predictions are compared to the ground truth for both the top and middle layers of the simulated hierarchy.
Hierarchical Clustering using Ward’s Distance (HC): This agglomerative clustering approach iteratively merges clusters to minimize the increase in within-cluster variance. Ward’s method tends to produce compact, spherical clusters and generates a dendrogram representing the full hierarchical structure of the data. In all simulations, HC is applied to the simulated gene expression data (node feature matrix). The optimal community predictions are determined by identifying the best cutting point on the dendrogram, aligning with the ground truth clusters (five clusters and 15 clusters).
There are several classical approaches to estimating the number of communities, \(k\), in unsupervised learning. Among these, two commonly adopted data-driven approaches are the Elbow Method and the optimization of a cluster quality metric, such as the Silhouette Score. Additionally, we also consider a graph-based approach known as the Bethe Hessian estimate.
Elbow Method:
The Elbow Method identifies the optimal number of clusters by plotting
the clustering cost (e.g., within-cluster sum of squares) against the
number of clusters, \(k\). The “elbow”
point, where the rate of decrease in the cost slows significantly, is
taken as the optimal \(k\). This method
assumes that beyond this point, adding more clusters does not improve
clustering quality significantly.
Silhouette Method:
The Silhouette Method evaluates the quality of clustering by measuring
how similar data points are to their assigned cluster (cohesion)
compared to other clusters (separation). The Silhouette score ranges
from -1 to 1, where higher values indicate better-defined clusters. The
optimal \(k\) is the value that
maximizes the average Silhouette score across all data points.
Bethe Hessian:
The Bethe Hessian is a graph-based method for estimating the number of
communities in a network. It involves constructing the Bethe Hessian
matrix \({\bf B}_\eta = (\eta^2 - 1){\bf I} +
{\bf D} - \eta {\bf A}\), where \(A\) is the adjacency matrix, \(D\) is the degree matrix, and \(\eta\) is a regularization parameter. The
number of negative eigenvalues of \(H(r)\) provides an estimate of the number
of communities. This approach leverages spectral properties of the graph
to identify structural groupings. According to [@]
The Hierarchical Community Detection (HCD) algorithm performs poorly on the top layer of the hierarchy, despite the clear community structure at this level especially in the case of the disconnected networks. This observation warrents further investigation into the model dynamics to determine why the community structure clearly reflected in embeddings is not also captured by the loss function.
Using the example network shown above, we debugged each stage of training HCD investigating both the loss calculation and visually determining whether the computed probabilities and intermediate steps accurately reflect community structure captured in embeddings produced by the autoencoder GATE.
The clustering loss for the \(\ell^{th}\) hierarchical layer is computed as:
\[{\bf D}_\ell = {\bf X}^T - {\bf M}_\ell {\bf P}_\ell^T \]
\[ L_{C_\ell} = tr\left(\sqrt{\text{diag}({\bf D}_\ell^T{\bf D}_\ell)}\right) \]
Where \({\bf D}_\ell \in \mathbb{R}^{n \times N_\ell}\), \({\bf P}_\ell \in \mathbb{R}^{N \times k_\ell}\), and
\[ {\bf M}_\ell = {\bf X}^T {\bf P}_\ell \cdot \left({\bf 1}_N^T {\bf P}_\ell\right)^{-1} \in \mathbb{R}^{n \times k_\ell} \]
are the centroids for the community predictions in the \(\ell^{th}\) hierarchical layer.
In general, this calculation seems consistent with the results of the model. The issue appears to be that the steps converting the embeddings to logits and eventually probabilities via softmax are getting distorted.
Recall that HCD consists of two modules: - 1. An autoencoder which produces a an embedding \(\bf Z\) - 2. A prediction model which converts the embedding into logits which capture model structure and then converts these logits into pseudo-probabilities vias the softmax activation function. This is the standard deep learning procedure for multiclass classification.
The previous framework for the prediction model is described in detail mathematically below:
The embeddings \({\bf M}\) were computed as
\[ {\bf M} = \text{LeakyReLU}({\bf Z} {\bf W}_{out} + \beta_{out}) \]
where \({\bf Z}\in\mathbb{R}^{N \times q}\) is the input embedding from GATE, \({\bf M}\in\mathbb{R}^{N \times k}\) is the output embedding, and \({\bf W}\in\mathbb{R}^{q \times k}\) is a matrix of learnable parameters.
The embedding \({\bf M}\) is then normalized using an element-wise affine:
\[ {\bf H} = \text{LayerNorm}({\bf M}) = \frac{{\bf M}_{ij} - \mathbb{E}[{\bf M}_i]}{\sqrt{\text{Var}[{\bf M}_i] + \epsilon}}\cdot \gamma + \beta \]
The logits of the normalized embedding \({\bf H}\) are then converted to probabilities via softmax. However, a dropout layer was mistakenly applied to \({\bf H}\) before the softmax activation. While this dropout layer was intended as part of the original architecture, its placement was incorrect, as channels should not be zeroed out before the final activation of the output layer in any architecture.
\[{\bf P} = \text{Softmax}({\bf H}) \]
This framework has two distinct problems:
See the example in Figure 2 below which shows the embeddings at the first and last training epoch under this framework:
After inspecting the embeddings at each stage of this computation, it was observed that activating the embedding with LeakyReLU before converting logits distorted the signals in the data, essentially washing out fine-grained community structure. This led to artifact community signals in the top layer of the hierarchy. Instead, we seek to address this fundamental issue in the prediction model using the following revised framework:
A new intermediate embedding {} is computed
\[ {\bf M} = {\bf Z} {\bf W}_{1} + \beta_1 \]
\[{\bf M}_{norm} = \text{LayerNorm}({\bf M}) \]
where \({\bf M}\in\mathbb{R}^{N \times q}\) is an intermediate output embedding, and \({\bf W}\in\mathbb{R}^{q \times q}\) is a matrix of learnable parameters that casts \(\bf Z\) to the same dimension. Again, \({\bf M}\) is normalized using an element-wise affine:
\[ H = \text{LeakyReLU}({\bf M}_{norm}) \]
\(\bf H\) is then converted to the
output embedding using a single fully connected layer
\[ {\bf O} = {\bf H}{\bf W}_{out} + \beta_{out} \]
\[{\bf P} = \text{Softmax}({\bf O}) \]
This revised framework eliminates the activation and normalization of embeddings before computing logits, removes inappropriate layer dropout, and ensures that the output embedding effectively shifts the logits to the appropriate embedding space. These adjustments preserve fine-grained and hierarchical community signals through three key changes:
#Both-Linear 5, 15
#SoftKMeans-Linear-5,15
#Both-Linear-BH-64